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Abstract — Approximation of the optimal two-part MDL 
code for given data, through successive monotonically length- 
decreasing two-part MDL codes, has the following properties: (i) 
computation of each step may take arbitrarily long; (ii) we may 
not know when we reach the optimum, or whether we will reach 
the optimum at all; (iii) the sequence of models generated may 
not monotonically improve the goodness of fit; but (iv) the model 
associated with the optimum has (almost) the best goodness of fit. 
To express the practically interesting goodness of lit of individual 
models for individual data sets we have to rely on Kolmogorov 
complexity. 

Index Terms — minimum description length, model selection, 
MDL code, approximation, model fitness, Kolmogorov complex- 
ity, structure functions, examples 



I. Introduction 

In machine learning pure applications of MDL are rare, 
partially because of the difficulties one encounters trying to 
define an adequate model code and data-to-model code, and 
partially because of the operational difficulties that are poorly 
understood. We analyze aspects of both the power and the 
perils of MDL precisely and formally. Let us first resurrect a 
familiar problem from our childhood to illustrate some of the 
issues involved. 

The process of solving a jigsaw puzzle involves an incre- 
mental reduction of entropy, and this serves to illustrate the 
analogous features of the learning problems which are the 
main issues of this work. Initially, when the pieces come out of 
the box they have a completely random ordering. Gradually we 
combine pieces, thus reducing the entropy and increasing the 
order until the puzzle is solved. In this last stage we have found 
a maximal ordering. Suppose that Alice and Bob both start to 
solve two versions of the same puzzle, but that they follow 
different strategies. Initially, Alice sorts all pieces according 
to color, and Bob starts by sorting the pieces according to 
shape. (For the sake of argument we assume that the puzzle 
has no recognizable edge pieces.) The crucial insight, shared 
by experienced puzzle aficionados, is that Alice's strategy is 
efficient whereas Bob's strategy is not and is in fact even worse 
than a random strategy. Alice's strategy is efficient, since the 
probability that pieces with about the same color match is 
much greater than the unconditional probability of a match. 
On the other hand the information about the shape of the pieces 
can only be used in a relatively late stage of the puzzle process. 
Bob's effort in the beginning is a waste of time, because he 
must reorder the pieces before he can proceed to solve the 
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puzzle. This example shows that if the solution of a problem 
depends on finding a maximal reduction of entropy this does 
not mean that every reduction of entropy brings us closer to 
the solution. Consequently reduction of entropy is not in all 
cases a good strategy. 

A. Entropy Versus Kolmogorov Complexity 

Above we use "entropy" in the often used, but inaccu- 
rate, sense of "measure of unorderedness of an individual 
arrangement." However, entropy is a measure of uncertainty 
associated with a random variable, here a set of arrangements 
each of which has a certain probability of occurring. The 
entropy of every individual arrangement is by definition zero. 
To circumvent this problem, often the notion of "empirical 
entropy" is used, where certain features like letter frequencies 
of the individual object are analyzed, and the entropy is taken 
with respect to the set of all objects having the same features. 
The result obviously depends on the choice of what features 
to use: no features gives maximal entropy and all features 
(determining the individual object uniquely) gives entropy zero 
again. Unless one has knowledge of the characteristics of a 
definite random variable producing the object as a typical out- 
come, this procedure gives arbitrary and presumably meaning- 
less, results. This conundrum arises since classical information 
theory deals with random variables and the communication of 
information. It does not deal with the information (and the 
complexity thereof) in an individual object independent of 
an existing (or nonexisting) random variable producing it. To 
capture the latter notion precisely one has to use "Kolmogorov 
complexity" instead of "entropy," and we will do so in our 
treatment. For now, the "Kolmogorov complexity" of a file is 
the number of bits in the ultimately compressed version of the 
file from which the original can still be losslessly extracted by 
a fixed general purpose decompression program. 

B. Learning by MDL 

Transferring the jigsaw puzzling insights to the general case 
of learning algorithms using the minimum description length 
principle (MDL), [10], [2], [11], we observe that although 
it may be true that the maximal compression yields the 
best solution, it may still not be true that every incremental 
compression brings us closer to the solution. Moreover, in the 
case of many MDL problems there is a complicating issue in 
the fact that the maximal compression cannot be computed. 

More formally, in constrained model selection the model is 
taken from a given model class. Using two-part MDL codes for 
the given data, we assume that the shortest two-part code for 
the data, consisting of the model code and the data-to-model 
code, yields the best model for the data. To obtain the shortest 
code, a natural way is to approximate it by a process of finding 
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ever shorter candidate two-part codes. Since we start with a 
finite two-part code, and with every new candidate two-part 
code we decrease the code length, eventually we must achieve 
the shortest two-part code (assuming that we search through 
all two-part codes for the data). Unfortunately, there are two 
problems: (i) the computation to find the next shorter two- 
part code may be very long, and we may not know how long; 
and (ii) we may not know when we have reached the shortest 
two-part code: with each candidate two-part code there is the 
possibility that further computation may yield yet a shorter 
one. But because of item (i) we cannot a priori bound the 
length of that computation. There is also the possibility that the 
algorithm will never yield the shortest two-part code because 
it considers only part of the search space or gets trapped in a 
nonoptimal two-part code. 

C. Results 

We show that for some MDL algorithms the sequence of 
ever shorter two-part codes for the data converges in a finite 
number of steps to the best model. However, for every MDL 
algorithm the intermediate models may not convergence mono- 
tonically in goodness. In fact, in the sequence of candidate 
two-part codes converging to a (globally or locally) shortest, it 
is possible that the models involved oscillate from being good 
to bad. Convergence is only monotone if the model-code parts 
in the successive two-part codes are always the shortest (most 
compressed) codes for the models involved. But this property 
cannot be guaranteed by any effective method. 

It is very difficult, if not impossible, to formalize the 
goodness of fit of an individual model for individual data in 
the classic statistics setting, which is probabilistic. Therefore, 
it is impossible to express the practically important issue 
above in those terms. Fortunately, new developments in the 
theory of Kolmogorov complexity [6], [15] make it possible to 
rigorously analyze the questions involved, possibly involving 
noncomputable quantities. But it is better to have a definite 
statement in a theory than having no definite statement at 
all. Moreover, for certain algorithms (like Algorithm Optimal 
MDL in Theorem |2} we can guarantee that they satisfy the 
conditions required, even though these are possibly noncom- 
putable. In Section [II] we review the necessary notions from 
[15], both in order that the paper is self-contained and the 
definitions and notations are extended from the previously used 
singleton data to multiple data samples. Theorem [T] shows 
that the use of MDL will be approximately invariant under 
recoding of the data. The next two sections contain the main 
results: Definition [4] defines the notion of an MDL algorithm. 
Theorem [2] shows that there exists such an MDL algorithm 
that in the (finite) limit results in an optimal model. The next 
statements are about MDL algorithms in general, also the 
ones that do not necessarily result in an optimal MDL code. 
Theorem [3] states a sufficient condition for improvement of 
the randomness deficiency (goodness of fit) of two consecutive 
length-decreasing MDL codes. This extends Lemma V.2 of the 
[15] (which assumes all programs are shortest) and corrects 
the proof concerned. The theory is applied and illustrated 
in Section [V] Theorem [4] shows by example that a minor 



violation of the sufficiency condition in Theorem [3] can result 
in worsening the randomness deficiency (goodness of fit) of 
two consecutive length-decreasing MDL codes. The special 
case of learning DFAs from positive examples is treated 
in Section [VI] The main result shows, for a concrete and 
computable MDL code, that a decrease in the length of the 
two-part MDL code does not imply a better model fit (see 
Section [Vl-Cb unless there is a sufficiently large decrease as 
that required in Theorem [3] (see Remark IT2b. 

II. Data and Model 

Let x,y,z G J\f, where J\f denotes the natural numbers and 
we identify M and {0, 1}* according to the correspondence 

(0,e), (1,0), (2,1), (3, 00), (4, 01),... 

Here e denotes the empty word. The length \x\ of x is the 
number of bits in the binary string x, not to be confused with 
the cardinality \S\ of a finite set S. For example, 1 1 1 = 3 and 
|e| = 0, while |{0, l} n | = 2 n and \0\ = 0. Below we will use 
the natural numbers and the binary strings interchangeably. 
Definitions, notations, and facts we use about prefix codes, 
self-delimiting codes, and Kolmogorov complexity, can be 
found in [9] and are briefly reviewed in Appendix lAl 

The emphasis is on binary sequences only for convenience; 
observations in any alphabet can be encoded in binary in a 
way that is theory neutral. Therefore, we consider only data 
x in {0, 1}*. In a typical statistical inference situation we are 
given a subset of {0, 1}*, the data sample, and are required to 
infer a model for the data sample. Instead of {0, 1}* we will 
consider {0, l} n for some fixed but arbitrarily large n. 

Definition 1 : A data sample D is a subset of {0, 1}™. For 
technical convenience we want a model M for D to contain 
information about the cardinality of D. A model M has the 
form M = M' \J{#i}, where M' C {0, 1}" and i e {0, 1}". 
We can think of i as the ith binary string in {0, 1}". Denote 
the cardinalities by lower case letters: 

d = \D\, m= \M'\. 

If D is a data sample and M is a model for D then D C M' C 
M, M = M' {J{#d}, and we write M □ D or D C M, 

Denote the complexity of a finite set S by K(S) — the 
length (number of bits) of the shortest binary program p from 
which the reference universal prefix machine U computes a 
lexicographic listing of the elements of A and then halts. That 
is, if S = {xi, . . . ,Xd}, the elements given in lexicographic 
order, then U(p) = (x\, (x 2l ■ ■ ■ , (xd-i 7 Xd) ■ ■ ■))■ The short- 
est program p, or, if there is more than one such shortest 
program, then the first one that halts in a standard dovetailed 
running of all programs, is denoted by S*. 

The conditional complexity K(D \ M) of D C M is the 
length (number of bits) of the shortest binary program p from 
which the reference universal prefix machine U from input M 
(given as a list of elements) outputs D as a lexicographically 
ordered list of elements and halts. We have 

K(D | M) < log ( ™) + 0(1). (1) 
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The upper bound follows by considering a self-delimiting 
code of D given M (including the number d of elements 
in D), consisting of a [log P?J] bit long index of D in the 
lexicographic ordering of the number of ways to choose d 
elements from M' = M—{#d}, This code is called the data- 
to-model code. Its length quantifies the maximal "typicality," 
or "randomness," any data sample D of d elements can have 
with respect to model M with M □ D. 

Definition 2: The lack of typicality of D with respect 
to M is measured by the amount by which K(D | M) falls 
short of the length of the data-to-model code. The randomness 
deficiency of D C M is defined by 

8(D\M)= log (™f)-K(D\M), (2) 

for D r_ M, and oo otherwise. 

The randomness deficiency can be a little smaller than 0, 
but not more than a constant. If the randomness deficiency 
is not much greater than 0, then there are no simple special 
properties that single D out from the majority of data samples 
of cardinality d to be drawn from M' = M — {jfd}. This 
is not just terminology: If S(D | M) is small enough, then 
D satisfies all properties of low Kolmogorov complexity that 
hold for the majority of subsets of cardinality d of M'. To be 
precise: A property P represented by M is a subset of M', 
and we say that D satisfies property P if D is a subset of P. 

Lemma 1 : Let d, m, n be natural numbers, and let D C 
M' C {0,1}", M = M'\J{#d}, \D\ = d,\M'\ = m, and 
let 8 be a simple function of the natural numbers to the real 
numbers, that is, K (8) is a constant, for example, 8 is log or 

V- 

(i) If P is a property satisfied by all D □ M with 8(D \ 
M) < 8(n), then P holds for a fraction of at least 1 - 1/2^") 
of the subsets of W = M ~ {#d}. 

(ii) Let P be a property that holds for a fraction of at least 
1 - l/2 5 (") of the subsets of M' = M - There is a 
constant c, such that P holds for every D C M with £(D | 
M) < <J(n) - K(P | M) - c. 

Proof: (i) By assumption, all data samples D C M with 

(771 \ 
d J-*(n) (3) 

satisfy P. There are only 

l°g(d)-*(«)-l , N 

E ^=r)2-^)-i 

programs of length smaller than log (™) — S(n), so there are 
at most that many D □ M that do not satisfy ©. There are 
(™) sets D that satisfy Z? C M, and hence a fraction of at 
least 1 - 1/2 5 W of them satisfy ©. 

(ii) Suppose P does not hold for a data sample D □ M 
and the randomness deficiency (O satisfies 8(D\M) < 8(n) — 
K{P\M) — c. Then we can reconstruct D from a description 
of M, and D's index j in an effective enumeration of all 
subsets of M of cardinality d for which P doesn't hold. There 
are at most (™^/2 s ^ such data samples by assumption, and 



therefore there are constants c\ , c-i such that 

K(D | M) < log j + ci < log I J - 5(n) + c 2 . 

Hence, by the assumption on the randomness deficiency of D, 
we find i-C (P|M) < c 2 — c, which contradicts the necessary 
nonnegativity of K(P\M) if we choose c > c-i- ■ 
The minimal randomness deficiency function of the data 
sample D is defined by 

P D (a) = mm{5(D \ M) : M □ D, K(M) < a}, (4) 

where we set min0 = oo. The smaller 5(D \ M) is, the more 
D can be considered as a typical data sample from M. This 
means that a set M for which D incurs minimal randomness 
deficiency, in the model class of contemplated sets of given 
maximal Kolmogorov complexity, is a "best fitting" model for 
D in that model class — a most likely explanation, and (3d (a) 
can be viewed as a constrained best fit estimator. 

A. Minimum Description Length Estimator 

The length of the minimal two-part code for D with model 
M □ D consist of the model cost K(M) plus the length of 
the index of D in the enumeration of choices of d elements 
out of m (m = \M'\ and M' = M - Consider the 

model class of A/'s of given maximal Kolmogorov complexity 
a. The MDL function or constrained MDL estimator is 

X D (a) = min{A(M) : M □ D, K(M) < a}, (5) 

M 

where A(M) = K{M) + log (™) > K(D) + 0(1) is the total 
length of two-part code of D with help of the model M. This 
function Ad (a) is the celebrated optimal two-part MDL code 
length as a function of a, with the model class restricted to 
models of code length at most a. The functions (3d and Ad 
are examples of Kolmogorov's structure functions, [6], [15]. 

Indeed, consider the following two-part code for D C M: 
the first part is a shortest self-delimiting program p for M 
and the second part is [log ("?)] bit long index of £> in the 
lexicographic ordering of all choices of d elements from M. 
Since M determines log (™) this code is self-delimiting and 
we obtain the two-part code, where the constant 0(1) is the 
length of an additional program that reconstructs D from its 
two-part code. Trivially, Ad (a) > K(D) + 0(1). For those 
a's that have Ad(ci) = K(D) + O(l), the associated model 
M □ D in at most a bits (witness for Ad (a)) is called a 
sufficient statistic for D. 

Lemma 2: If M is a sufficient statistic for D, then the 
randomness deficiency of D in M is 0(1), that is, D is a 
typical data sample for M, and M is a model of best fit for 
D. 

Proof: If M is a sufficient statistic for D, then K(M) + 
log (™) = # (£>) + 0(1). The left-hand side of the latter 
equation is a two-part description of D using the model 
M □ 13 and as data-to-model code the index of D in the 
enumeration of the number of choices of d elements from 
M in log (™) bits. This left-hand side equals the right-hand 
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side which is the shortest one-part code of D in K(D) bits. 
Therefore, 

K(D) < K(D, M) + 0(1) 

< K(M)+K(D | M) + 0(1) 

< K(M) + log r£j + 0(1) = K(D) + O(l). 

The first and second inequalities are straightforward, the third 
inequality states that given M □ D we can describe D in a 
self-delimiting manner in log (7) +0(1) bits, and the final 
equality follows by the sufficiency property. This sequence of 
(in)equalities implies that K(D \ M) = log (™) + 0(1). ■ 

Remark 1 (Sufficient but not Typical): Note that 
the data sample D can have randomness deficiency about 0, 
and hence be a typical element for models M, while M is 
not a sufficient statistic. A sufficient statistic M for D has 
the additional property, apart from being a model of best 
fit, that K(D,M) = K(D) + 0(1) and therefore by © 
in Appendix lAl we have K(M\D*) = 0(1): the sufficient 
statistic M is a model of best fit that is almost completely 
determined by D*, a shortest program for D. 

Remark 2 (Minimal Sufficient Statistic): The suf- 
ficient statistic associated with Az?(a) with the least a is called 
the minimal sufficient statistic. 

Remark 3 (Probability Models): Reference [15] and 
this paper analyze a canonical setting where the models are 
finite sets. We can generalize the treatment to the case where 
the models are the computable probability mass functions. 
The computabiHty requirement does not seem very restric- 
tive. We cover most, if not all, probability mass functions 
ever considered, provided they have computable parameters. 
In the case of multiple data we consider probability mass 
functions P that map subsets B C {0, 1}™ into [0, 1] such 
that Esc{o i}" P ( B ) = L For ever y < d < 2", we define 
P d (B) = P(B | \B\ = d). For data D with \D\ =dwe obtain 
Xn(a) = vom Pd {K(P d ) + log l/P d (D) : P d (D) > and 
Pd is a computable probability mass function with K(P d ) < 
a}. The general model class of computable probability mass 
functions is equivalent to the finite set model class, up to an 
additive logarithmic O(logdn) term. This result for multiple 
data generalizes the corresponding result for singleton data in 
[13], [15]. Since the other results in [15] such as © and those 
in Appendix [B] generalized to multiple data, hold only up to 
the same additive logarithmic term anyway, they carry over to 
the probability models. 

The generality of the results are at the same time a re- 
striction. In classical statistics one is commonly interested in 
model classes that are partially poorer and partially richer than 
the ones we consider. For example, the class of Bernoulli 
processes, or fc-state Markov chains, is poorer than the class 
of computable probability mass functions of moderate max- 
imal Kolmogorov complexity a, in that the latter class may 
contain functions that require far more complex computations 
than the rigid syntax of the classical classes allows. Indeed, 
the class of computable probability mass functions of even 
moderate complexity allows implementation of a function 
mimicking a universal Turing machine computation. On the 



other hand, even the simple Bernoulli process can be equipped 
with a noncomputable real bias in (0, 1), and hence the 
generated probability mass function over n trials is not a 
computable function. This incomparability of the algorithmic 
model classes studied here and the traditional statistical model 
classes, means that the current results cannot be directly 
transplanted to the traditional setting. They should be regarded 
as pristine truths that hold in a platonic world that can be used 
as guideline to develop analogues in model classes that are of 
more traditional concern, as in [11]. 

B. Essence of Model Selection 

The first parameter we are interested in is the simplicity 
K(M) of the model M explaining the data sample D (D C 
M). The second parameter is how typical the data is with 
respect to M, expressed by the randomness deficiency 8(D | 
M) = log (™) -K(D | M). The third parameter is how short 
the two part code A(M) = K(M)+log (™) of the data sample 
D using theory M with D C M is. The second part consists 
of the full-length index, ignoring saving in code length using 
possible nontypicality of D in M (such as being the first d 
elements in the enumeration of M' = M — {#d}). These 
parameters induce a partial order on the contemplated set of 
models. We write Mi < M2, if Mi scores equal or less than 
M2 in all three parameters. If this is the case, then we may 
say that Mi is at least as good as AI2 as an explanation for D 
(although the converse need not necessarily hold, in the sense 
that it is possible that Mi is at least as good a model for D 
as M-2 without scoring better than Ma in all three parameters 
simultaneously). 

The algorithmic statistical properties of a data sample D are 
fully represented by the set Ajj of all triples 

(K(M),5(D\M),A(M)) 

with M □ D, together with a component wise order relation 
on the elements of those triples. The complete characterization 
of this set follows from the results in [15], provided we 
generalize the singleton case treated there to the multiple data 
case required here. 

In that reference it is shown that if we minimize the length 
of a two-part code for an individual data sample, the two- 
part code consisting of a model description and a data-to- 
model code over the class of all computable models of at 
most a given complexity, then the following is the case. 
With certainty and not only with high probability as in the 
classical case this process selects an individual model that 
in a rigorous sense is (almost) the best explanation for the 
individual data sample that occurs among the contemplated 
models. (In modern versions of MDL, [4], [2], [11], one selects 
the model that minimizes just the data-to-model code length 
(ignoring the model code length), or minimax and mixture 
MDLs. These are not treated here.) These results are exposed 
in the proof and analysis of the equality: 

Pd(<*) =X D (a)-K(D), (6) 

which holds within negligible additive O(logdn) terms, in 
argument and value. We give the precise statement in © in 
Appendix [B] 
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Remark 4: Every model (set) M that witnesses the value 
Ad (a), also witnesses the value (3d (ct) (but not vice versa). 
The functions Ad and (3d can assume all possible shapes 
over their full domain of definition (up to additive logarithmic 
precision in both argument and value). We summarize these 
matters in Appendix 151 

C. Computability 

How difficult is it to compute the functions Xd,(3d, and 
the minimal sufficient statistic? To express the properties 
appropriately we require the notion of functions that are not 
computable, but can be approximated monotonically by a 
computable function. 

Definition 3: A function / : Af — > 1Z is upper semicom- 
putable if there is a Turing machine T computing a total func- 
tion (f> such that <f)(x,t + 1) < <fi(x,t) and lim^oo </>(x, t) = 
f(x). This means that / can be computably approximated 
from above. If — / is upper semicomputable, then / is lower 
semicomputable. A function is called semicomputable if it is 
either upper semicomputable or lower semicomputable. If / is 
both upper semicomputable and lower semicomputable, then 
we call / computable (or recursive if the domain is integer or 
rational). 

To put matters in perspective: even if a function is com- 
putable, the most feasible type identified above, this doesn't 
mean much in practice. Functions like f(x) of which the 
computation terminates in computation time of t(x) = x x (say 
measured in flops), are among the easily computable ones. 
But for x = 30, even a computer performing an unrealistic 
Teraflop per second, requires 30 30 /10 12 > 10 28 seconds. 
This is more than 3 • 10 20 years. It is out of the question 
to perform such computations. Thus, the fact that a function 
or problem solution is computable gives no insight in how 
feasible it is. But there are worse functions and problems 
possible: For example, the ones that are semicomputable but 
not computable. Or worse yet, functions that are not even 
semicomputable. 

Semicomputability gives no knowledge of convergence 
guarantees: even though the limit value is monotonically 
approximated, at no stage in the process do we know how 
close we are to the limit value. In Section [HI] the indirect 
method of Algorithm Optimal MDL shows that the function 
\d (the MDL-estimator) can be monotonically approximated 
in the upper semicomputable sense. But in [15] it was shown 
for singleton data samples, and therefore a fortiori for multiple 
data samples D, the fitness function (3d (the direct method 
of Remark |6]l cannot be monotonically approximated in that 
sense, nor in the lower semicomputable sense, in both cases 
not even up to any relevant precision. Let us formulate this a 
little more precisely: 

The functions Ad (a), (3d (ce) have a finite domain for a 
given D and hence can be given as a table — so formally 
speaking they are computable. But this evades the issue: there 
is no algorithm that computes these functions for given D and 
a. Considering them as two-argument functions it was shown 
(and the claimed precision quantified): 

> The function Ad (a) is upper semicomputable but not 
computable up to any reasonable precision. 



> There is no algorithm that given D* and a finds Ad (a). 

> The function (3d (a) is not upper nor lower semicom- 
putable, not even to any reasonable precision. To put 
/3d (a) 's computability properties in perspective, clearly 
we can compute it given an oracle for the halting problem. 

The halting problem is the problem whether an 
arbitrary Turing machine started on an initially all- 
tape will eventually terminate or compute forever. 
This problem was shown to be undecidable by A.M. 
Turing in 1937, see for example [9]. An oracle for 
the halting problem will, when asked, tell whether 
a given Turing machine computation will or will 
not terminate. Such a device is assumed in order to 
determine theoretical degrees of (non)computability, 
and is deemed not to exist. 
But using such an oracle gives us power beyond effective 
(semi)computability and therefore brings us outside the 
concerns of this paper. 

> There is no algorithm that given D and K(D) finds a 
minimal sufficient statistic for D up to any reasonable 
precision. 

D. Invariance under Recoding of Data 

In what sense are the functions invariant under recoding of 
the data? If the functions (3d and Ad give us the stochastic 
properties of the data D, then we would not expect those 
properties to change under recoding of the data into another 
format. For convenience, let us look at a singleton example. 
Suppose we recode D = {x} by a shortest program x* 
for it. Since x* is incompressible it is a typical element of 
the set of all strings of length \x*\ = K(x), and hence 
X x * (a) drops to the Kolmogorov complexity K(x) already 
for some a < K(K(x)), so almost immediately (and it stays 
within logarithmic distance of that line henceforth). That is, 
A x * (a) = K(x) up to logarithmic additive terms in argument 
and value, irrespective of the (possibly quite different) shape of 
X x . Since the Kolmogorov complexity function K{x) = \x*\ 
is not recursive, [5], the recoding function f(x) = x* is also 
not recursive. Moreover, while / is one-to-one and total it is 
not onto. But it is the partiality of the inverse function (not all 
strings are shortest programs) that causes the collapse of the 
structure function. If one restricts the finite sets containing x* 
to be subsets of {y* : y G {0, 1}™}, then the resulting function 
X x * is within a logarithmic strip around X x . The coding 
function / is upper semicomputable and deterministic. (One 
can consider other codes, using more powerful computability 
assumptions or probabilistic codes, but that is outside the scope 
of this paper.) However, the structure function is invariant 
under "proper" recoding of the data. 

Theorem 1: Let / be a recursive permutation of the set 
of finite binary strings in {0, 1}™ (one-to-one, total, and onto), 
and extend / to subsets D C {0, Then, A/(d) is "close" 
to Ad in the sense that the graph of A/(d) is situated within 
a strip of width K(f) + 0(1) around the graph of Ad- 

Proof: Let M □ D be a witness of Ad (a). Then, 
M f = {f(y) :yeM} satisfies K(Mj) <a + K(f) + 0(1) 
and \M f \ = \M\. Hence, X f(D) (a + K(f) + O(l)) < X D {a). 



6 



Let M f □ f(D) be a witness of A/(u)(a). Then, M f f _ 1 = 
{/^(y) ■ y € M-^} satisfies ^(M/.O < a + #(/) + 0(1) 
and |MLi| = |M'|. Hence, A D (a + K(f) + 0(1)) < 
X fiD )(a) (since = #(/) + 0(1)). ■ 

III. Approximating the MDL Code 

Given D C {0, 1}™, the data to explain, and the model 
class consisting of all models M for D that have complexity 
K(M) at most a. This a is the maximum complexity of an 
explanation we allow. As usual, we denote m = \M\ — 1 
(possibly indexed like m t — \M t \ — 1) and d = \D\. We 
search for programs p of length at most a that print a finite 
set M Zl D. Such pairs (p, M) are possible explanations. 
The best explanation is defined to be the (p, M) for which 
(5(D | M) is minimal, that is, 5{D \ M) = (3 D (a). Since the 
function (3d{o<) is not computable, there is no algorithm that 
halts with the best explanation. To overcome this problem we 
minimize the randomness deficiency by minimizing the MDL 
code length, justified by ©, and thus maximize the fitness of 
the model for this data sample. Since © holds only up to a 
small error we should more properly say "almost minimize 
the randomness deficiency" and "almost maximize the fitness 
of the model." 

Definition 4: An algorithm A is an MDL algorithm if 
the following holds. Let D be a data sample consisting of d 
separated words of length n in dn + 0(log dn) bits. Given 
inputs D and a (0 < a < dn + 0(\ogdn)), algorithm 
A written as A(D, a) produces a finite sequence of pairs 
(pi, M%), (p 2 , M 2 ), . . . , (pf,Mf), such that every p t is a bi- 
nary program of length at most a that prints a finite set Mt 
with D C M t and \p t \ + log (™<) < \p t -i\ + log ( mt d 1 ) for 
every 1 < t < f. 

Remark 5: It follows that K(M t ) < \p t \ for all 1 < t < 
f. Note that an MDL algorithm may consider only a proper 
subset of all binary programs of length at most a. In particular, 
the final \pf\ +log ( m /) may be greater than the optimal MDL 
code of length min{A'(Af)+log ('") : M □ D, K(M) < a}. 
This happens when a program p printing M with M □ D and 
\p\ = K(M) < a is not in the subset of binary programs 
considered by the algorithm, or the algorithm gets trapped in 
a suboptimal solution. 

The next theorem gives an MDL algorithm that always finds 
the optimal MDL code and, moreover, the model concerned 
is shown to be an approximately best fitting model for dat D. 

Theorem 2: There exists an MDL algorithm which given 
D and a satisfies lim t ->oo(pt, M t ) — (p,M), such that 
5{D\M) < D (i - O(logdra)) + O(logrfn). 

Proof: We exhibit such an MDL algorithm: 

Algorithm Optimal MDL (£>, a) 

Step 1. Let D be the data sample. Run all binary 
programs p\ , p%, . . . of length at most a in lexico- 
graphic length-increasing order in a dovetailed style. 
The computation proceeds by stages 1,2,..., and in 
each stage j the overall computation executes step 
j — k of the particular subcomputation of pk, for 
every k such that j — k > 0. 



Step 2. At every computation step t, consider all 
pairs (p, M) such that program p has printed the 
set M □ D by time t. We assume that there is a 
first elementary computation step to such that there 
is such a pair. Let a best explanation (p t ,M t ) at 
computation step t > to be a pair that minimizes the 
sum \p\ + log (™) among all the pairs (p, M). 
Step 3. We only change the best explanation 
(pt-i, M t -i) of computation step t — 1 to (p t ,M t ) 
at computation step t, if \p t \ + log ('"') < \pt-i \ + 

% rr)- 

In this MDL algorithm the best explanation (p t , M t ) changes 
from time to time due to the appearance of a strictly better 
explanation. Since no pair (p, M) can be elected as best 
explanation twice, and there are only finitely many pairs, 
from some moment onward the explanation (p t ,M t ) which 
is declared best does not change anymore. Therefore the limit 
(p,M) exists. The model M is a witness set of Au(i). The 
lemma follows by (O and Remark |4] ■ 
Thus, if we continue to approximate the two-part MDL code 
contemplating every relevant model, then we will eventually 
reach the optimal two-part code whose associated model is 
approximately the best explanation. That is the good news. The 
bad news is that we do not know when we have reached this 
optimal solution. The functions hrj and Xrj, and their witness 
sets, cannot be computed within any reasonable accuracy, 
Section III-CI Hence, there does not exist a criterion we could 
use to terminate the approximation somewhere close to the 
optimum. 

In the practice of the real-world MDL, in the process of 
finding the optimal two-part MDL code, or indeed a subop- 
timal two-part MDL code, we often have to be satisfied with 
running times t that are much less than the time to stabilization 
of the best explanation. For such small t, the model M t has a 
weak guarantee of goodness, since we know that 

6(D\Mt)+K(D)<\ Pt \+ log (^j, 

because K(D) < K(D, M t ) < K(M t )+K(D\M t ) and there- 
fore K(D) - K(D\M t ) < K(M t ) < \p t \ (ignoring additive 
constants). That is, the randomness deficiency of D in M t plus 
K{D) is less than the known value \p t \ +log Theorem[2] 
implies that Algorithm MDL gives not only some guarantee of 
goodness during the approximation process (see Section ITl-CI ). 
but also that, in the limit, that guarantee approaches the value 
of its lower bound, that is, <5(D|M) + K(D). Thus, in the 
limit, Algorithm Optimal MDL will yield an explanation that 
is only a little worse than the best explanation. 

Remark 6: (Direct Method) Use the same dovetailing 
process as in Algorithm Optimal MDL, with the following 
addition. At every elementary computation step t, select a 
(p, M) for which log — K l (D\M ) is minimal among all 
programs p that up to this time have printed a set M □ D. 
Here K t {D\M) is the approximation of K (D\M ) from above 
defined by K l (D\M ) = min{|<7| : the reference universal pre- 
fix machine U outputs D on input (q, M) in at most t steps}. 
Hence, log \ T\— K l (D\M) is an approximation from below to 
<5(D|M). Let (q t , M t ) denote the best explanation after t steps. 



7 



We only change the best explanation at computation step t, if K(b). By the assumption in the theorem, 

log (7) - K\D\M t ) < log ("V) - JT^plMt-i). This 



time the same explanation can be chosen as the best one twice. 
However, from some time t onward, the best explanation 
(qt,M t ) does not change anymore. In the approximation 
process, the model M t has no guarantee of goodness at all: 
Since /3d (a) is not semicomputable up to any significant 
precision, Section IH-CI we cannot know a significant upper 
bound neither for S(D\M t ), nor for 5{D\M t )+K{D). Hence, 
we must prefer the indirect method of Algorithm Optimal 
MDL, approximating a witness set for Ad (a), instead of the 
direct one of approximating a witness set for /3d (a). 



IV. Does Shorter MDL Code Imply Better Model? 

In practice we often must terminate an MDL algorithm 
as in Definition [4] prematurely. A natural assumption is that 
the longer we approximate the optimal two-part MDL code 
the better the resulting model explains the data. Thus, it is 
tempting to simply assume that in the approximation every 
next shorter two-part MDL code also yields a better model. 
However, this is not true. To give an example that shows where 
things go wrong it is easiest to first give the conditions under 
which premature search termination is all right. Suppose we 
replace the currently best explanation (pi,M\) in an MDL 
algorithm with explanation (p 2 , M 2 ) only if \p 2 \ +log (™ 2 ) i s 
not just less than \pi \ + log ( m d 1 ), but less by more than the 
excess of \p±\ over K(Mi), Then, it turns out that every time 
we change the explanation we improve its goodness. 

Theorem 3: Let D be a data sample with \D\ = d 
(0 < d < 2"). Let (pi,Mi) and {p 2 ,M 2 ) be sequential (not 
necessary consecutive) candidate best explanations, produced 
by an MDL algorithm A(D, a). If 



■ [ m d 2 j < bil + iogf 7 ^ 1 



-(M-i^M^-lOloglog 



2" 



then S(D\M 2 ) < 5{D\M 1 ) - 5 log log ( 2 J). 

Proof: For every pair of sets M\ , M 2 "Jfwe have 

S(D\M 2 ) - S(D\Mi) = A + A, 

with A = A(M 2 ) - A (Mi) and 

A = -K{M 2 ) - K(D\M 2 ) + K(Mi) + K(D\M X ) 

< -K(M 2 , D) + K(M 1 ,D) + K(M;\Mi) + 0(1) 

< K(M 1 ,D\M 2 ,D) + K{Mt\M 1 )+0{l). 

The first inequality uses the trivial —K(M 2 , D) > -K{M 2 )- 
K{D\M 2 ) and the nontrivial K(M U D) + K(M^\M 1 ) > 
KiMx) + K(D\M X ) which follows by (H), and the second 
inequality uses the general property that K(a\b) > K(a) — 



A < b 2 1 + log 

= |P2 I + log 

+(bi 

< —10 log log 



m 2 
d 

m 2 
d 

K{M X )) 
d 



- A (M x ) 

- ( bil +iog 



m 1 
d 



Since by assumption the difference in MDL codes A = 
A(M 2 ) - A(Afi) > 0, it suffices to show that 
KiM^DlM^D) + K{Ml\Mx) < 51oglog( 2 d ") to prove 
the theorem. Note that (pi,Mi) and (p 2 ,M 2 ) are in this 
order sequential candidate best explanations in the algorithm, 
and every candidate best explanation may appear only once. 
Hence, to identify (pi,Mi) we only need to know the MDL 
algorithm A, the maximal complexity a of the contem- 
plated models, the data sample D, the candidate explanation 
(p 2 ,M 2 ), and the number j of candidate best explanations 
in between (pi,Mi) and (p 2 ,M 2 ). To identify M* from Mi 
we only require K(M\) bits. The program p 2 can be found 
from M 2 and the length \p 2 \ < a, as the first program 
computing M 2 of length \p 2 \ in the process of running the 
algorithm A(D,a). Since A is an MDL algorithm we have 
j < \pi\ +log(7) < a + logfj), and K{M X ) < a. 
Therefore, 



K(Mi,D\M 2 ,D) 

< log \p 2 \ +loga 

< 3 log a + log 



■K{M*\M X 
log K{Mi) 

d 



log 



-log.? 
+ b, 



where b is the number of bits we need to encode the description 
of the MDL algorithm, the descriptions of the constituent 
parts self-delimitingly, and the description of a program to 
reconstruct Mj* from M\. Since a < n + O(logn), we find 

K(M U D\M 2 , D) + K(M*\Mi) 

/2™\ / /2" 
< 3 log n + log log I 1 + O I log log log I 



< 5 log log 



2" 



where the last inequality follows from < d < 2™ and d 
being an integer. ■ 

Remark 7: We need an MDL algorithm in order to restrict 
the sequence of possible candidate models examined to at most 
a + log ( 2 d ) with a < nd + 0(log nd) rather than all of the 
2 2 "~ d possible models M satisfying M □ D. 

Remark 8: In the sequence (pi,M x ), (p 2 ,M 2 ), . . . , of 
candidate best explanations produced by an MDL algorithm, 
(p t ',M t >) is actually better than (pt,M t ) (t < t'), if the 
improvement in the two-part MDL code-length is the given 
logarithmic term in excess of the unknown, and in general 
noncomputable \p t \ — K(M t ). On the one hand, if \p t \ = 
K(M t ) + O(l), and 



be I + log 



d 



< btl+iog 



d 



— 10 log log 



then Mf is a better explanation for data sample D than M t , 
in the sense that 

5(£>|M t ,) < <J(D|Mt) - 5 log log J. 

On the other hand, if \p t \ — K(M t ) is large, then M t / may be 
a much worse explanation than M t . Then, it is possible that 
we improve the two-part MDL code-length by giving a worse 
model Mt' using, however, a p t ' such that \p t '\ +log ( m f) < 
\ Pt \ + log (7) while 8{D\M t i) > 8{D\M t ). 

V. Shorter MDL Code May Not Be Better 

Assume that we want to infer a language, given a single 
positive example (element of the language). The positive 
example is D = {x} with X — X ]^ X 2 ■ ■ ■ X i X £ {0, 1} for 
1 < i < n. We restrict the question to inferring a language 
consisting of a set of elements of the same length as the 
positive example, that is, we infer a subset of {0, 1}™. We can 
view this as inferring the slice L n of the (possibly infinite) 
target language L consisting of all words of length n in the 
target language. We identify the singleton data sample D with 
its constituent data string x. For the models we always have 
M = M' U{#1} with M' C {0, 1}". For simplicity we delete 
the cardinality indicator {#1} since it is always 1 and write 
M = M' C {0, 1}". 

Every M C {0, 1}™ can be represented by its characteristic 
sequence \ — Xi ■ ■ ■ Xi n with Xi = 1 if the ith element of 
{0, 1}™ is in M, and otherwise. Conversely, every string of 
2 n bits is the characteristic sequence of a subset of {0, 1}". 
Most of these subsets are "random" in the sense that they can- 
not be represented concisely: their characteristic sequence is 
incompressible. Now choose some integer 5. Simple counting 
tells us that there are only 2 2 ~ A — 1 binary strings of length 
< 2" — (5. Thus, the number of possible binary programs of 
length < 2 n — 6 is at most 2 2 — 1. This in turn implies (since 
every program describes at best one such set) that the number 
of subsets M C {0, 1}™ with K{M\n) < 2 n - 6 is at most 
2 2 n -6 _ L Therefore, the number of subsets M C {0, 1}™ 
with 

K(M\n) > 2" - 6 

is greater than 

(1 - l/2 5 )2 2 ". 

Now if K(M) is significantly greater than K(x), then it is 
impossible to learn M from x. This follows already from the 
fact that K(M\x) > K(M\x*) + O(l) = K(M) - K{x) + 
K{x\M*) + 0(1) by © (note that K(x\M*~) > 0). That is, 
we need more than K(M) — K(x) extra bits of dedicated 
information to deduce M from x. Almost all sets in {0, 1}™ 
have so high complexity that no effective procedure can infer 
this set from a single example. This holds in particular for 
every (even moderately) random set. 

Thus, to infer such a subset M C {0, 1}™, given a sample 
datum x £ M, using the MDL principle is clearly out of the 
question. The datum x can be literally described in n bits 
by the trivial MDL code M = {x} with x literal at self- 
delimiting model cost at most n + O(logn) bits and data-to- 
model cost log|Af| = 0. It can be concluded that the only 



sets M that can possibly be inferred from x (using MDL or 
any other effective deterministic procedure) are those that have 
K(M) < K{x) < n + 0(\ogn). Such sets are extremely rare: 
only an at most 

2-2"+n+0(logn) 

fraction of all subsets of {0, l} n has that small prefix complex- 
ity. This negligible fraction of possibly learnable sets shows 
that such sets are very nonrandom; they are simple in the 
sense that their characteristic sequences have great regularity 
(otherwise the Kolmogorov complexity could not be this 
small). But this is all right: we do not want to learn random, 
meaningless, languages, but only languages that have meaning. 
"Meaning" is necessarily expressed in terms of regularity. 

Even if we can learn the target model by an MDL algorithm 
in the limit, by selecting a sequence of models that decrease 
the MDL code with each next model, it can still be the case 
that a later model in this sequence is a worse model than a 
preceding one. Theorem [3] showed conditions that prevent this 
from happening. We now show that if those conditions are not 
satisfied, it can indeed happen. 

Theorem 4: There is a datum x (\x\ = n) with expla- 
nations (pt,M t ) and (pf,M t /) such that \p t >\ + logm f / < 
\p t \ + \ogm t - lOlogn but 5(x\M t >) > 5{x\M t ). That 
is, Mf is much worse fitting than M t . There is an MDL 
algorithm A(x,n) generating (p t ,M t ) and (p t >,M t ') as best 
explanations with t' > t. 

Remark 9: Note that the condition of Theorem |3] is dif- 
ferent from the first inequality in Theorem [4] since the former 
required an extra — \p t \ + K(M t ) term in the right-hand side. 



Proof: Fix datum x of length n which can be divided 
in uvw with u,v,w of equal length (say n is a multiple of 
3) with K (x) = K(u) + K(v) + K(w) = |n, K(u) = \n, 
K(v) = |n, and K(w) = (with the last four equalities 
holding up to additive O(logn) terms). Additionally, take n 
sufficiently large so that O.ln ^ 10 log n. 

Define x l — X\Xi . . . X{ and an MDL algorithm A(x, n) that 
examines the sequence of models Mi = {x l }{0, with 
i = 0, ^n, |n,n. The algorithm starts with candidate model 
Mq and switches from the current candidate to candidate Mi, 
i = |n, |n, n, if that model gives a shorter MDL code than 
the current candidate. 

Now K(Mi) — K(x l ) + O(logn) and log mi = n — i, so 
the MDL code length A(Mj) = K(x i ) + n-i^ r O(\ogn). Our 
MDL algorithm uses a compressor that does not compress x l 
all the way to length K(x l ), but codes x 1 self-delimitingly at 
0.9i bits, that is, it compresses x l by 10%. Thus, the MDL 
code length is 0.9i + log to.; = 0.9i + n — i — n — O.li for 
every contemplated model Mi (i — 0, in, |n, n). The next 
equalities hold again up to O(logn) additive terms. 

> The MDL code length of the initial candidate model 
Mo is n. The randomness deficiency S(x\Mo) = n — 
K(x\Mq) — -|n. The last equality holds since clearly 
K(x\M ) = K{x\n) = |n. 

• For the contemplated model M n / 3 we obtain the follow- 
ing. The MDL code length for model Af„/ 3 is n — n/30. 
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The randomness deficiency 5(x\M n / 3 ) = logm„/ 3 — 
K(x\M n/3 ) = ln-K(v\n)-K(w\n) = ln. 
• For the contemplated model M 2 „/3 we obtain the follow- 
ing. The MDL code length is n— 2n/30. The randomness 
deficiency is S(x\M 2n/3 ) = logm 2n / 3 - K(x\M 2n/3 ) = 
i?i — K(w\n) = |n. 
Thus, our MDL algorithm initializes with candidate model 
Mo, then switches to candidate M n / 3 since this model de- 
creases the MDL code length by n/30. Indeed, M n / 3 is a 
much better model than Mo, since it decreases the randomness 
deficiency by a whopping |n. Subsequently, however, the 
MDL process switches to candidate model M 2n / 3 since it 
decreases the MDL code length greatly again, by rt/30. But 
M2„/3 is a much worse model than the previous candidate 
M„/3, since it increases the randomness deficiency again 
greatly by in. ■ 
Remark 10: By Theorem[3]we know that if in the process 
of MDL estimation by a sequence of significantly decreasing 
MDL codes a candidate model is represented by its shortest 
program, then the following candidate model which improves 
the MDL code is actually a model of at least as good fit as 
the preceding one. Thus, if in the example used in the proof 
above we encode the models at shortest code length, we obtain 
MDL code lengths n for M , K(u) + |n = \n for M n / 3 , 
and K(u) + K(v) + \n = §n for M 2n / 3 . Hence the MDL 
estimator using shortest model code length changes candidate 
model Mo for M n / 3 , improving the MDL code length by |n 
and the randomness deficiency by |n. However, and correctly, 
it does not change candidate model M n / 3 for M 2n / 3 , since that 
would increase the MDL code length by ^n. It so prevents, 
correctly, to increase the randomness deficiency by ^n. Thus, 
by the cited theorem, the oscillating randomness deficiency in 
the MDL estimation process in the proof above can only arise 
in cases where the consecutive candidate models are not coded 
at minimum cost while the corresponding two-part MDL code 
lengths are decreasing. 

VI. Inferring a Grammar (DFA) From Positive 
Examples 

Assume that we want to infer a language, given a set 
of positive examples (elements of the language) D. For 
convenience we restrict the question to inferring a language 
M = M'U{#4 with M' C {0,1}". We can view this as 
inferring the slice L n (corresponding to M') of the target 
language L consisting of all words of length n in the target 
language. Since D consists of a subset of positive examples 
of M' we have D C M. To infer a language M from a set 
of positive examples D C M is, of course, a much more 
natural situation than to infer a language from a singleton x 
as in the previous section. Note that the complexity K(x) 
of a singleton x of length n cannot exceed n + O(logn), 
while the complexity of a language of which x is an element 
can rise to 2™ + O(logn). In the multiple data sample setting 
K(D) can rise to 2" + O(logn), just as K(M) can. That is, 
the description of n takes O(logn) bits and the description 
of the characteristic sequence of a subset of {0, 1}" may 
take 2" bits, everything self-delimitingly. So contrary to the 



singleton datum case, in principle models M of every possible 
model complexity can be inferred depending on the data D 
at hand. An obvious example is D = M — Note that 

the cardinality of D plays a role here, since the complexity 
K(D\n) < log ( 2 J) + 0(\ogd) with equality for certain D. A 
traditional and well-studied problem in this setting is to infer 
a grammar from a language example. 

The field of grammar induction studies among other things 
a class of algorithms that aims at constructing a grammar by 
means of incremental compression of the data set represented 
by the digraph of a deterministic finite automaton (DFA) 
accepting the data set. This digraph can be seen as a model 
for the data set. Every word in the data set is represented as 
a path in the digraph with the symbols either on the edges 
or on the nodes. The learning process takes the form of a 
guided incremental compression of the data set by means of 
merging or clustering of the nodes in the graph. None of these 
algorithms explicitly makes an estimate of the data-to-model 
code. Instead they use heuristics to guide the model reduction. 
After a certain number of computational steps a proposal for 
a grammar can be constructed from the current state of the 
compressed graph. Examples of such algorithms are SP [17], 
[16], EMILE [1], ADIOS [14], and a number of DFA in- 
duction algorithms, such as "Evidence Driven State Merging" 
(EDSM), [7], [18]. Related compression-based theories and 
applications appear in [8], [3]. Our results (above and below) 
do not imply that compression algorithms improving the MDL 
code of DFAs can never work on real life data sets. There 
is considerable empirical evidence that there are situations in 
which they do work. In those cases specific properties of a 
restricted class of languages or data sets must be involved. 

Our results are applicable to the common digraph simpli- 
fication techniques used in grammar inference. The results 
hold equally for algorithms that use just positive examples, 
just negative examples, or both, using any technique (not just 
digraph simplification). 

Definition 5: A DFA A = (S,Q,q ,t, F), where S is 
a finite set of input symbols, Q is a finite set of states, t : 
Q x S — > Q is the transition function, qo £ Q is the initial 
state, and F C Q is a set of final states. 

The DFA A is started in the initial state qo. If it is in state 
q € Q and receives input symbol s € S it changes its state to 
q 1 = t(q, s). If the machine after zero or more input symbols, 
say si, . . . , s n , is driven to a state q € F then it is said to 
accept the word w = Si . . . s n , otherwise it rejects the word 
w. The language accepted by A is L(A) = {w : w is accepted 
by A}. We denote L n (A) = L(A) f|{0, 1}". 

We can effectively enumerate the DFAs as Ai,A 2l ... in 
lexicographic length-increasing order. This enumeration we 
call the standard enumeration. 

The first thing we need to do is to show that all laws 
that hold for finite-set models also hold for DFA models, so 
all theorems, lemmas, and remarks above, both positive and 
negative, apply. To do so, we show that for every data sample 
D C {0, 1}™ and a contemplated finite set model for it, there 
is an almost equivalent DFA. 

Lemma 3: Let d = \D\, M' = M - {#d} and m = \M'\. 
For every D C M' C {0,1}™ there is a DFA A with 
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L n (A) = M' such that K(A, n) < K(Al') + 0(1) (which 
implies K(A,(i,n) < K(M) + C»(1)), and 5(D | M) < S(D \ 
A,d,n) + 0(\). 

Proof: Since M' is a finite set of binary strings, there 
is a DFA that accepts it, by elementary formal language 
theory. Define DFA A such that A is the first DFA in the 
standard enumeration for which L n (A) = Al' . (Note that 
we can infer n from both M and Al'.) Hence, K {A, n) < 
K(M') + 0(1) and K(A,d,n) < K(M) + 0(1). Trivially, 
log (7) = log( |L y }l ) and K(D | A,n) < K(D \ M') + 
O(l), since A may have information about D beyond Al'. 
This implies K(D | A,d,n) < K(D \ Al) + O(l), so that 
S(D | M) < S(D | A, d, n) + 0(1). ■ 

Lemma|4]is the converse of Lemma[3] for every data sample 
D and a contemplated DFA model for it, there is a finite 
set model for D that has no worse complexity, randomness 
deficiency, and worst-case data-to-model code for D, up to 
additive logarithmic precision. 

Lemma 4: Use the terminology of Lemma [3] For every 
D C L n {A) C {0, 1}™, there is a model M □ D such that 
log (7) = log( |i y )l ), K(M') < K(A,n) + 0(l) (which 
implies K(A1) < K(A, d, n) + 0(l)), and 6(D \ M) < 5(D \ 
A,d,n)-0(1). 

Proof: Choose M' = L n (A). Then, log (™) = 
log( |L "^ A)l ) and both K(M') < K(A,n) + 0(1) and 
K(A1) < K(A,d,n) + 0(1). Since also K(D \ A,d,n) < 
K(D | Al) + O(l), since A may have information about D 
beyond M, we have S(D \ A, d, n) > S(D | M) + 0(1). ■ 

A. MDL Estimation 

To analyze the MDL estimation for DFAs, given a data 
sample, we first fix details of the code. For the model code, 
the coding of the DFA, we encode as follows. Let A = 
(Q,S,t,q ,F) with q = \Q\, s = \S\, and / = \F\. By 
renaming of the states we can always take care that F C Q 
are the last / states of Q. There are q sq different possibilities 
for t, q possibilities for qo, and q possibilities for /. Altogether, 
for every choice of q, s there are < q qs+2 distinct DFAs, some 
of which may accept the same languages. 

Small Model Cost but Difficult to Decode: We can 
enumerate the DFAs by setting i := 2, 3, ... , and for every 
i consider all partitions i = q + s to two positive integer 
summands, and for every particular choice of q, s considering 
every choice of final states, transition function, and initial state. 
This way we obtain a standard enumeration A\, A%, . . . of all 
DFAs, and, given the index j of a DFA Aj we can retrieve 
the particular DFA concerned, and for every n we can find 
L n (A 3 ). 

Larger Model Cost but Easy to Decode: We encode a 
DFA A with q states and s symbols self-delimitingly by 

> The encoding of the number of symbols s in self- 
delimiting format in [logs] + 2 [log logs] + 1 bits; 

* The encoding of the number of states q in self-delimiting 
format in [logg] + 2 [log log q] + 1 bits; 

> The encoding of the set of final states F by indicating 
that all states numbered q— f,q — f + l,q are final states, 
by just giving q — f in [logg] bits; 



> The encoding of the initial state qo by giving its index in 
the states 1, . . . , q, in [logq] bits; and 

> The encoding of the transition function t in lexicographic 
order of Q x S in [log q] bits per transition, which takes 
qs\\ogq\ bits altogether. 

Altogether, this encodes A in a self-delimiting format in (qs + 
3)[lo g(? ] + 2riogloggl + [logs] + 2 [log logs] + O(l) « 
(qs + 4) log (j + 2 logs bits. Thus, we reckon the model cost 
of a (q, s)-DFA as m(q,s) — (qs + 4)logq + 2 logs bits. 
This cost has the advantage that it is easy to decode and that 
m(q, s) is an easy function of q, s. We will assume this model 
cost. 

Data-to-model cost: Given a DFA model A, the word 
length n in log n + 2 log log n bits which we simplify to 
21ogn bits, and the size d of the data sample D C {0, 1}", 
we can describe D by its index j in the set of d choices 
out of / = L n (A) items, that is, up to rounding upwards, 
log Q) bits. For < d < 1/2 this can be estimated by 
lH{d/l) - logZ/2 + O(l) < logQ < lH(d/l), where 
H(p) = p\ogl/p+ (1 -p) log 1/(1 -p) (0 < p < 1) is 
Shannon's entropy function. For d — 1 or d — I we set the 
data-to-model cost to 1 + 21ogn, for 1 < d < 1/2 we set it 
to 2 logn + lH(d/l) (ignoring the possible saving of a logZ/2 
term), and for 1/2 < d < I we set it to the cost of d! = I — d. 
This reasoning brings us to the following MDL cost of a data 
sample D for DFA model A: 

Definition 6: The MDL code length of a data sample D 
of d strings of length n, given d, for a DFA model A such 
that D C L n (A) denoting I = \L n (A)\, is given by 

MDL(D,A\d) = (qs + 4) logq + 2logs + 2\ogn + lH(d/l). 

If d is not given we write MDL(D, A). 

B. Randomness Deficiency Estimation 

Given data sample D and DFA A with D C L n (A) C 
{0, 1}™, we can estimate the randomness deficiency. Again, 
use I = L n (A) and d = \D\. By (fJJ, the randomness 
deficiency is 

S(D | A, d, n) = log - K (D \ A, d, n). 

Then, substituting the estimate for log Q) from the previous 
section, up to logarithmic additive terms, 

5(D | A, d, n) = lH(d/l) - K(D \ A, d, n). 

Thus, by finding a computable upper bound for K(D 
A,d,n), we can obtain a computable lower bound on the 
randomness deficiency S(D \ A, d, n) that expresses the fitness 
of a DFA model A with respect to data sample D. 

C. Less MDL Code Length Doesn 't Mean Better Model 

The task of finding the smallest DFA consistent with a 
set of positive examples is trivial. This is the universal DFA 
accepting every example (all of {0, 1}™). Clearly, such a 
universal DFA will in many cases have a poor generalization 
error and randomness deficiency. As we have seen, optimal 
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randomness deficiency implies an optimal fitting model to the 
data sample. It is to be expected that the best fitting model 
gives the best generalization error in the case that the future 
data are as typical to this model as the data sample is. We 
show that the randomness deficiency behaves independently 
of the MDL code, in the sense that the randomness deficiency 
can either grow or shrink with a reduction of the length of the 
MDL code. 

We show this by example. Let the set D be a sample set 
consisting of 50% of all binary strings of length n with an even 
number of l's. Note, that the number of strings with an even 
number of l's equals the number of strings with an odd number 
of l's, so d = \D\ = 2 n /4. Initialize with a DFA A such that 
L n {A) = D. We can obtain D directly from A, n, so we have 
K(D \A,n) = 0(1), and since d = I (I = \L n {A)\) we have 
log Q = 0, so that altogether 8(D \ A, d, n) = -0(1), while 
MDL(D, A) = MDL(D,A\d) + 0(1) = {qs + 4)log<? + 
21ogs + 21ogn + 0(1) = (2? + 4)logg + 21ogn+0(l), 
since s = 2. (The first equality follows since we can obtain d 
from n. We obtain a negative constant randomness deficiency 
which we take to be as good as randomness deficiency. All 
arguments hold up to an O(l) additive term anyway.) Without 
loss of generality we can assume that the MDL algorithm 
involved works by splitting or merging nodes of the digraphs 
of the produced sequence of candidate DFAs. But the argument 
works for every MDL algorithm, whatever technique it uses. 

Initialize: Assume that we start our MDL estimation with 
the trivial DFA Aq that literally encodes all d elements of D 
as a binary directed tree with q nodes. Then, 2"~ 1 — 1 < q < 
2«+i _ l, which yields 

MDL(D,A ) > 2 n n 
5(D | Aa,d,n) « 0. 

The last approximate equality holds since d = I, and hence 
logQ = and K(D \ A ,d,n) = 0(1). Since the 
randomness deficiency 5(D \ Ao,d,n) « 0, it follows that 
Aq is a best fitting model for D. Indeed, it represents all 
conceivable properties of D since it literally encodes D. 
However, Aq does not achieve the optimal MDL code. 

Better MDL estimation: In a later MDL estimation we 
improve the MDL code by inferring the parity DFA A\ with 
two states (q = 2) that checks the parity of l's in a sequence. 
Then, 

/2 n " 1 \ 1 
MDL(D,A 1 ) < 8 + 21ogn + log ( A « 2"- 1 - -n 

S(D | Ai,d,n) =log l 2 \-K(D \ A x ,d,n) 

« 2"" 1 - \i- K{D | A u d,n) 

We now consider two different instantiations of D, denoted as 
Do and D\, The first one is regular data, and the second one 
is random data. 

Case 1, regular data: Suppose D = D a consisting of 
the lexicographic first 50% of all n-bit strings with an even 
number of occurrences of l's. Then K(Dq \ Ai,d,n) = 0(1) 
and 

5(D | A u d,n) = 2 n ~ 1 - 0(n). 



In this case, even though DFA A\ has a much better MDL 
code than DFA Aq it has nonetheless a much worse fit since 
its randomness deficiency is far greater. 

Case 2, random data: Suppose D is equal to D\, where D\ 
is a random subset consisting of 50% of the n-bit strings with 
even number of occurrences of l's. Then, K(D\ \ A\, d, n) = 
log QZl) + 0(1) w 2"- 1 - in, and 

5(D 1 | A 1 ,d,n) « 0. 

In this case, DFA A\ has a much better MDL code than 
DFA ylo, and it has equally good fit since both randomness 
deficiencies are about 0. 

Remark 1 1 : We conclude that improved MDL estimation 
of DFAs for multiple data samples doesn't necessarily result 
in better models, but can do so nonetheless. 

Remark 12 (Shortest Model Cost): By Theorem [3] 
we know that if, in the process of MDL estimation by a 
sequence of significantly decreasing MDL codes, a candidate 
DFA is represented by its shortest program, then the following 
candidate DFA which improves the MDL estimation is actually 
a model of at least as good fit as the preceding one. Let us look 
at an Example: Suppose we start with DFA A^ that accepts 
all strings in {0, 1}*. In this case we have q = 1 and 

MDL(D , A 2 ) = log (A + O(logn) 

5(D \A 2 ,d,n)=log f 2 f_ 2 )-0(l). 

Here log (T_ 2 ) = 2 n H{\) - 0(n) w|-2 n - 0(n), since 
H(j) w |. Suppose the subsequent candidate DFA is the 
parity machine Ay. Then, 

MDL(D , Ai) = log (J + O(logn) 

5(D \A 1 ,d,n)^ log I J-O(l), 

since K(D Q \ A 1 ,d,n) = 0(1). Since log (llll) = 2"- 1 - 
0(n), we have MDL(D ,A X ) « |MDL(L> , A 2 ), and 
5(D | A u d,n) w \5(D \ A 2 ,d,n). Therefore, the 
improved MDL cost from model A 2 to model A\ is accom- 
panied by an improved model fitness since the randomness 
deficiency decreases as well. This is forced by Theorem [3] 
since both DFA A x and DFA A 2 have K(A X ), K(A 2 ) = 0(1). 
That is, the DFAs are represented and penalized according 
to their shortest programs (a fortiori of length 0(1)) and 
therefore improved MDL estimation increases the fitness of 
the successive DFA models significantly. 

Appendix 

A. Appendix: Preliminaries 

1 ) Self-delimiting Code: A binary string y is a proper prefix 
of a binary string x if we can write x = yz for z^£. A set 
{x, y, . . .} C {0, 1}* is prefix-free if for every pair of distinct 
elements in the set neither is a proper prefix of the other. A 
prefix-free set is also called a prefix code and its elements are 
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called code words. As an example of a prefix code, encode 
the source word x = x\x 2 ■ ■ ■ x n by the code word 

x= l n 0x. 

This prefix-free code is called self-delimiting, because there 
is fixed computer program associated with this code that can 
determine where the code word x ends by reading it from 
left to right without backing up. This way a composite code 
message can be parsed in its constituent code words in one 
pass, by the computer program. Since we use the natural 
numbers and the binary strings interchangeably, the notation 
\x\ where x is ostensibly an integer means the length in bits 
of the self-delimiting code of the xth binary string. On the 
other hand, the notation \x\ where x is ostensibly a binary 
string means the self-delimiting code of the length \x\ of the 
binary string x. Using this code we define the standard self- 
delimiting code for x to be x' = \x\x. It is easy to check 
that \x\ = 2n + l and \x'\ = n + 21ogn + 1. Let (•) denote 
a standard invertible effective one-to-one code from Af x Af 
to a subset of Af. For example, we can set (x, y) = x'y or 
(x,y) = xy. We can iterate this process to define (a;, (y,z)), 
and so on. 

2) Kolmogorov Complexity: For precise definitions, nota- 
tion, and results see the textbook [9]. Informally, the Kol- 
mogorov complexity, or algorithmic entropy, K(x) of a string 
x is the length (number of bits) of a shortest binary program 
(string) to compute a; on a fixed reference universal computer 
(such as a particular universal Turing machine). Intuitively, 
K(x) represents the minimal amount of information required 
to generate x by any effective process. The conditional Kol- 
mogorov complexity K (x\y) of x relative to y is defined 
similarly as the length of a shortest program to compute x, 
if y is furnished as an auxiliary input to the computation. 
For technical reasons we use a variant of complexity, so- 
called prefix complexity, which is associated with Turing 
machines for which the set of programs resulting in a halting 
computation is prefix free. We realize prefix complexity by 
considering a special type of Turing machine with a one-way 
input tape, a separate work tape, and a one-way output tape. 
Such Turing machines are called prefix Turing machines. If a 
machine T halts with output x after having scanned all of p 
on the input tape, but not further, then T(p) = x and we call 
p a program for T. It is easy to see that {p : T(p) = x, x e 
{0, 1}*} is a prefix code. 

Let Ti , T2 , . . . be a standard enumeration of all prefix 
Turing machines with a binary input tape, for example the 
lexicographic length-increasing ordered syntactic prefix Turing 
machine descriptions, and let <f>i , <p2 , ■ ■ ■ be the enumeration 
of corresponding functions that are computed by the respec- 
tive Turing machines (Ti computes fc). These functions are 
the partial recursive functions or computable functions (of 
effectively prefix-free encoded arguments). The prefix (Kol- 
mogorov) complexity of x is the length of the shortest binary 
program from which x is computed. For the development of 
the theory we require the Turing machines to use auxiliary 
(also called conditional) information, by equipping the ma- 
chine with a special read-only auxiliary tape containing this 
information at the outset. 



One of the main achievements of the theory of computation 
is that the enumeration T\ , T2, . . . contains a machine, say U = 
T u , that is computationally universal in that it can simulate 
the computation of every machine in the enumeration when 
provided with its index: U((y,ip) — Ti((y,p)) for all i,p,y. 
We fix one such machine and designate it as the reference 
universal prefix Turing machine. 

Definition 7: Using this universal machine we define the 
prefix (Kolmogorov) complexity 

K(x\y)=mm{\q\:U((y,q))=x}, (7) 
9 

the conditional version of the prefix Kolmogorov complexity 
of x given y (as auxiliary information). The unconditional 
version is set to K(x) = K(x \ e). 

In this paper we use the prefix complexity variant of 
Kolmogorov complexity only for convenience; the plain Kol- 
mogorov complexity without the prefix property would do just 
as well. The functions K(-) and K(- | •), though defined in 
terms of a particular machine model, are machine-independent 
up to an additive constant and acquire an asymptotically 
universal and absolute character through Church's thesis, that 
is, from the ability of universal machines to simulate one 
another and execute any effective process. The Kolmogorov 
complexity of an individual object was introduced by Kol- 
mogorov [5] as an absolute and objective quantification of 
the amount of information in it. The information theory 
of Shannon [12], on the other hand, deals with average 
information to communicate objects produced by a random 
source. Since the former theory is much more precise, it is 
surprising that analogues of theorems in information theory 
hold for Kolmogorov complexity, be it in somewhat weaker 
form. An example is the remarkable symmetry of information 
property. Let x* denote the shortest prefix-free program for a 
finite string x, or, if there are more than one of these, then 
x* is the first one halting in a fixed standard enumeration 
of all halting programs. It follows that K(x) = \x*\. Denote 
K(x,y) = K((x,y)). Then, 

K(x, y) = K(x) + K(y | x*) + 0(1) (8) 
= K(y) + K(x\y*) + 0(l). 

3) Precision: It is customary in this area to use "additive 
constant c" or equivalently "additive 0(1) term" to mean a 
constant, accounting for the length of a fixed binary program, 
independent from every variable or parameter in the expression 
in which it occurs. 

B. Appendix: Structure Functions and Model Selection 

We summarize a selection of the results in [15]. There, the 
data sample D is a singleton set {x}. The results extend to 
the multiple data sample case in the straightforward way. 

(i) The MDL code length X D (a) with D C {0, l} n and d = 
\D\ can assume essentially every possible relevant shape X(a) 
as a function of the maximal model complexity a that is al- 
lowed up to an additive 0(log dn) term in argument and value. 
(Actually, we can take this term as O (log n+ log log ( 2 d )), but 
since this is cumbersome we use the larger O (log (in) term. 
The difference becomes large for < d < 2 n .) These A's 
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are all integer-valued nonincreasing functions such that A is 
defined on [0, k] where k = K(D), such that A(0) < log ( 2 J) 
and A(fc) = k. This is Theorem IV.4 in [15] for singleton data 
x. There, X x is contained in a strip of width O(logn) around 
A. For multiple data D (\D\ — d) a similar theorem holds up 
to an 0(loge?7i) additive term in both argument and value, that 
is, the strip around A in which Ad is situated now has width 
0(log nd). (The strip idea is made precise in (O below for 
another result.) As a consequence, so-called "nonstochastic" 
data D for which Ad (a) stabilizes on K(D) only for large a 
are common. 

(ii) A model achieving the MDL code length Ad (a), essen- 
tially achieves the best possible fit (3d(o)- This is Theorem 
IV.8 in [15] for singleton data and © in this paper for multiple 
data. The precise form is: 

(3 D (a) + K(D) > mm{\ D (a') : \a - a\ = 0{\ogdn)} 
-0(logdn), (9) 

(3 D (a)+K(D) < max{A jD (a') : \a' - a\ = 0(logdn)} 
+0(logdn), 

X D (a) - K(D) > min{(3 D (a') : \a' - a\ = 0(logdn)} 
— 0(\og dn), 

X D (a)-K(D) < max{/3 D (a') : |a' - a\ = O (log dn)} 
+(9(log dn), 

with < a < K(D) and 0(logdn) < a' < K(D). 

(iii) As a consequence of (i) and (ii), the best-fit function 
(3d can assume essentially every possible relevant shape as 
a function of the contemplated maximally allowed model 
complexity a. 

From the proof of Item (ii), we see that, given the data 
sample D, for every finite set M □ D, of complexity at 
most a + 0(logdn) and minimizing A(M), we have 5(D \ 
M) < (3d (a) + O (log dn). Ignoring 0(logdn) terms, at 
every complexity level a, every best model at this level 
witnessing Ad (a) is also a best one with respect to typicality 
([6j. This explains why it is worthwhile to find shortest two- 
part descriptions Ad (a) for the given data sample D: this is 
the single known way to find an M □ D with respect to which 
D is as typical as possible at model complexity level a. Note 
that the set {(D,M,/3) | D C M, S(D | M) < (3} is not 
enumerable so we are not able to generate such ill's directly, 
[15]. 

The converse is not true: not every model (a finite set) 
witnessing /3u(a) also witnesses Ad (a). For example, let 
D = {x} with x a string of length n with K(x) > n. Let 
Mi = {0, 1}" U {y0 . . . 0} (we ignore the {#1} set giving the 
data sample cardinality since D is a singleton set), where y 
is a string of length | such that K(x, y) > ^ and let M 2 = 
{0, 1}™. Then both M U M 2 witness (3 D (%+0(logn)) = 0(1) 
but A (Mi) = 4f + O(logra) > A D (f + O(logn)) = 
n + O(logn) while A(M 2 ) =n + O(logn). 
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